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Abstract. Bibliometrics has the ambitious goal of measuring science. 
To this end, it exploits the way science is disseminated trough scientific 
publications and the resulting citation network of scientific papers. We 
survey the main historical contributions to the field, the most interesting 
bibliometric indicators, and the most popular bibliometric data sources. 
Moreover, we discuss distributions commonly used to model bibliometric 
phenomena and give an overview of methods to build bibliometric maps 
of science. 



1 Introduction 

Bibliometrics is a research method used in library and information science. It 
uses quantitative analysis and statistics in order to: 

— determine the influence of single scholars or groups of them (e.g., research 
groups, institutions, countries) and that of single papers or groups of them 
(e.g., journals or entire research fields); 

— describe the relationships between authors, publications, journals, or re- 
search fields. 

Bibliometrics has become a standard tool of science policy and research man- 
agement in the last decades. Academic institutions increasingly rely on biblio- 
metric analysis for making decisions regarding hiring, promotion, tenure, and 
funding of scholars; authors, librarians, and publishers may use citation indica- 
tors to evaluate journals and to select those of high impact; editors may choose 
reviewers on the basis of their bibliometric scores on a particular subject of in- 
terest; worldwide college and university rankings, e.g., THE- Qlfl and ARWT0, 
which are partially based on bibliometric criteria, are often consulted by prospec- 
tive students and their parents in the college and university admissions process. 
Today, bibliometrics is one of the rare truly interdisciplinary research fields, with 
important links with history of science, sociology, law, economics, management, 
theology, mathematics, statistics, physics, and computer science. 

Citation analysts retrieve production and citation data from bibliometric 
data sources and compute performance indicators to measure the quality of 
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research of an actor. The bibliographic databases of the Institute for Scientific 
Information, today Thomson-Reuters, have been used for decades as a starting 
point and often as the only tools for locating citations and conducting citation 
analysis. Fierce competitors of the databases provided by Thomson-Reuters are 
Elsevier's Scopus and the freely accessible Google Scholar. 

Actors under evaluation are typically individual scholars and journals, but 
bibliometric units can be composed of homogeneous groups of scholars or groups 
of journals at different levels of aggregation. Bibliometric criteria that charac- 
terize research quality are productivity, impact (or popularity), and prestige. 
Typically bibliometric indicators capture separately some of these criteria, hence 
the need of using in the evaluation process several orthogonal metrics capturing 
different aspects of research performance. 

The outline of this manuscript is as follows. We first briefly review the main 
historical contributions to the field in Section [21 In Section [3] we discuss the 
controversial role of citations in bibliometrics. Section 0] surveys the most in- 
teresting bibliometric indicators both at the individual and at the journal level, 
while Section [5] is devoted to the comparison of the most popular bibliometric 
data sources. Section [S] investigates the probability distributions that underlie 
most phenomena in bibliometrics. In Section[7]we delve into the realm of biblio- 
metric maps of science. Finally, Section [5] contains some of the best quotations 
about bibliometrics. 

2 Historical remarks 

Bibliometric studies started long time ago. A remarkable early piece of work is 
Histoire des sciences et des savants depuis deux siecles. The author, Alphonse 
de Candolle, describes the scientific strength of nations and tries to find envi- 
ronmental factors for the scientific success of a nation pQ. 

Derek John de Solla Price (1922 - 1983), an historian of science and infor- 
mation scientist born from Philip Price, a tailor, and Fanny de Solla, a singer, is 
credited as the father of bibliometrics. In his book Little Science, Big Science, 
he analyzed the recent system of science communication and laid the foundation 
of modern research evaluation techniques [2]. 

The term bibliometrics is introduced by Pritchard in 1969 3j. Pritchard 
explains the term bibliometrics as: 

the application of mathematical and statistical methods to books and other 
media of communication 

At the same time, Nalimov and Mulchenko defined scientometrics as: 

the application of those quantitative methods which are dealing with the 
analysis of science viewed as an information process 

According to these definitions, scientometrics is restricted to science commu- 
nication, whereas bibliometrics is designed to deal with more general information 



processes. Nowadays, the borderlines between the two specialities almost van- 
ished and both terms are used almost as synonyms. 

The statistical analysis of scientific literature began years before the term 
bibliometrics was coined. The main contributions are: Lotka's Law of scientific 
productivity, Bradford's Law of scatter, and Zipf 's Law of word occurrence. In 
1926, Alfred J. Lotka published a study on the frequency distribution of scientific 
productivity determined from a decennial index of Chemical Abstracts f5j (see 
Table [IJ. Lotka concluded that: 

In a given field, the number of authors making n contributions is about 
1/n 2 of those making one. 

Lotka's Law means that few authors contribute most of the papers and many 
or most of them contribute few publications. For instance, in the original data 
of Lotka's study illustrated in Table [TJ the most prolific 1350 authors (21% of 
the total) wrote more than half of the papers (6429 papers, 51% of the total). 
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Table 1. Original Lotka data and fit on the basis of Lotka distribution (up to 
10 papers) 



Eight years after Lotka's article appeared, Bradford published his study on 
the frequency distribution of papers over journals [5j. It states that: 

If scientific journals are arranged in order of decreasing productivity on 
a given subject, they can be divided into groups of different sizes ( number 
of journals ) each containing the same number of papers relevant to the 
subject. The size of each group (except the first) is given by the size of 
the previous group multiplied by a constant. 

Bradford formulated his law after studying a bibliography of geophysics (Ta- 
ble [2]). Journals can be divided in 3 groups of different sizes but containing about 



the same number of relevant papers: a core group, columns 1 and 2 of the ta- 
ble, containing 2 journals and 179 relevant papers, a second group, columns 3-6, 
containing 4 journals and 185 relevant papers, and a third group, columns 7-17, 
containing 11 journals and 186 relevant papers. 
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Table 2. Original Bradford data for subject geophysics. Bradford arranged jour- 
nals in order of decreasing productivity on the subject and counted the number 
of papers in the journal that are relevant to geophysics. 



What it means is that for each speciality there are few core journals for 
that field that contribute a relatively great amount of publications and many 
or most of the journals give few contributions. Hence, it is sufficient to identify 
the core journals for that field and look at them. Very rarely will researchers 
need to go outside that set. This may serve, for example, to guide librarians 
in choosing the core journals to stock in any given field. Bradford's Law also 
caused the discovery, which some did not expect, that a few journals like Nature 
and Science were core for all of hard science. The same pattern does not happen 
with the humanities or the social science - possibly because objective truth is 
so much harder to establish there. The result of this is pressure on scientists to 
publish in the best journals, and pressure on universities to ensure access to that 
core set of journals. 

Zipf formulated an interesting law in bibliometrics and quantitative linguis- 
tics that he derived from the study of word frequency in texts [6]. Zipf's Law 
states that: 

In relatively lengthy texts, if words occurring within the text are listed 
in order of decreasing freguency, then the rank of a word on that list 
multiplied by its freguency will egual a constant, which depends on the 
analyzed text. 

Zipf illustrated his law with an analysis of James Joyce's Ulysses (Table[3]). It 
means that only a few words are used very often, many or most are used rarely. 
Why natural language texts conform to Zipfian distribution has been a matter 
of some controversy. Zipf explains his law with the Principle of Least Effort, 
defined as follows: 

Each individual will adopt a course of action that will involve the expen- 
diture of the probably least average of his work. 

According to Zipf, if the Principle of Least Effort works, the speaker (or 
writer) tends to minimize number and length of words (this he calls the Force of 
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Table 3. The distribution of words in Joyce's Ulysses 



Unification), by overloading the same word with different meanings, while the 
hearer (or reader) calls for a diversification of words (this he calls the Force of 
Diversification), by assigning different meanings to different words. For commu- 
nication to be effective, these opposite forces need to equilibrate, giving rise to 
the mentioned law of word occurrence. 

Only in the beginning of the 1980's, with the fast development of computer 
science, bibliometrics could evolve into a distinct scientific discipline with a spe- 
cific research profile and corresponding communication structures. Scientomet- 
rics, the first international periodical specialized on bibliometric topics, started 
in 1979. The fact that bibliometric methods are already applied to the biblio- 
metric field itself also indicates the rapid growth of the discipline. 

3 The role of citations 

A central question is: why bibliometric analysis of research performance? Peer 
review, that is, the evaluation made by expert peers, undoubtedly is an important 
procedure of quality judgment. In particular, the results of peer review judgment 
and those of bibliometric assessment are not completely independent variables. 
Indeed, peers take some bibliometric aspects into account in their judgment, for 
instance number of publications in the better journals. 

But peer review and related expert-based judgments may have serious short- 
comings. Subjectivity, i.e., dependence of the outcomes on the choice of individ- 
ual committee members, is one of the major problems. Moreover, peer review is 



slow and expensive (at least in terms of hours of volunteer work devoted to ref- 
ereeing). In particular, peer review methodology is practically unfeasible when 
the number of units to evaluate is consistent, e.g., all papers published by all 
members of a large department. 

Bibliometric assessment of research performance is based on the following 
central assumptions [7j: 

— scholars who have to say something important do publish their findings; 

— scholars refer in their own work to earlier work of other scholars to acknowl- 
edge intellectual debt and to witness the use of information. 

In research evaluation, citations became a widely used measure of the impact of 
scientific publication. Smith [3] stated that: 

Citations are signposts left behind after information has been utilized. 

while Cronin [5] defined citations as: 

frozen footprints in the landscape of scholarly achievement which bear 
witness of the passage of ideas. 

However, problems with citation analysis as a reliable instrument of mea- 
surement and evaluation have been acknowledged. Citations reflect both the 
needs and idiosyncrasies of the titer, including such factors as utility, quality, 
availability, advertising (self-citations), collaboration or comradeship (in- house 
citations), chauvinism, mentoring, personal sympathies and antipathies, compe- 
tition, neglect, obliteration by incorporation, augmentation, flattery, convention, 
reference copying, reviewing, and secondary referencing |10j . As Seglen says ([!!]. 
page 636), "while the sheer number of factors may help to achieve some statisti- 
cal balance, we all know of scientists who are cited either much less ( ourselves ) 
or much more than they deserve on the basis of their scientific achievements". 

Nevertheless, citation analysis has demonstrated its reliability and usefulness 
as a tool for ranking and evaluation scholars and their publications [12] . Further- 
more, the robustness of citations as a method to evaluate impact is particularly 
witnessed by the adoption of a similar approach in several other fields far dif- 
ferent from bibliometrics, including web pages connected by hyperlinks [13114] . 
patents and corresponding citations [15] , published opinions of judges and their 
citations within and across opinion circuits [16] . and even sections of the Bible 
and the biblical citations they receive in religious texts [17] . 

4 Bibliometric indicators 

Assuming the central bibliometric assumptions mentioned in Section [3] we may 
design quantitative indicators to assess research quality of an actor. But, what 
aspects characterize quality of research? Moreover, what are the actors under 
evaluation? 

There is a general agreement that research quality is not characterized by a 
single element of performance. Van Raan [18] claims: 



It is not wise to force the assessment of researchers or of research groups 
into just one measure, because it reinforces the opinion that scientific 
performance can be expressed simply by one note. Several indicators are 
necessary in order to illuminate different aspects of performance. 

Moreover, Glanzel [19] adds: 

the use of a single index crashes the multidimensional space of biblio- 
metrics into one single dimension. 

Two potential dangers of condensing down quality of research to a single 
metric are: 

— a person may be damaged by the use of a simple index in a decision-making 
process if the index fails to capture important and different aspects of re- 
search performance; 

— scientists may respond to this by maximizing that particular metric to the 
detriment of doing more justifiable work. 

Quality of research is therefore described by different aspects; the most im- 
portant are: 

— productivity; this is the amount of scholarly works that are produced by the 
actor; 

— impact (or popularity); this is the number of endorsements that the actors 
receives from other actors; 

— prestige] this is the prestige of the works produced by the actor and that of 
the endorsing actors. 

The actors under bibliometric evaluation may be different, however, the ba- 
sic unit of evaluation is a single scholarly work, typically, a journal paper. This 
basic unit, a scholarly work, can be aggregated at different levels obtaining more 
complex units of evaluation. For instance, single scholars are evaluated in terms 
of the set of works they produced. Scholars are typically grouped into research 
groups, institutions, regions within nations, countries or even international re- 
gions. Moreover, scholarly works are typically aggregated into journals or con- 
ferences and research fields. The level of aggregation is: 

— micro-level, when individuals, research groups or single scholarly works are 
considered; 

— meso-level, in the case of institutions or journals; 

— macro-level, in the case of regions, countries or research fields. 

Before delving into the realm of bibliometric indicators, it is important to 
understand that different scholarly disciplines can have very different publica- 
tion and citation practices, including the absolute number of researchers, the 
average number of authors on each paper, the average number of citations in 
each paper, and the nature of results [20]. All these factors complicate the use 



of evaluation metrics across different disciplines. Nevertheless, interdisciplinary 
indicators have been proposed along the following line. In principle, it is pos- 
sible to compute the mean (or median) number of citations per paper for an 
entire research discipline [21] . Hence, the actual number of citations for a pub- 
lication can be compared with the expected number of citations for the field of 
the publication [7j. 

4.1 Bibliometric measures at the individual level 

The traditional bibliometric indexes used to evaluate the performance of indi- 
vidual scholars include: 

— the number of publications produced by the scholar, possibly divided by the 
scholar's academic age; 

— the number of citations that the publications produced by the scholar have 
received from other scholarly works, possibly divided by the number of pub- 
lications. 

A more interesting measure is the h index. The h index of a scholar is the 
higher number of papers a scholar has that have each received at least that 
number of citations. For instance, my current h index computed with Google 
Scholar is 14, meaning that I am the author of 14 papers each of them cited at 
least 14 times. The rest of my papers are all cited a number of times that is less 
or equal to 14. The index was proposed by Hirsch, a physicist, in 2005 [22] and 
it has immediately found interest in the public [23 24 and in the bibliometrics 
literature. In particular, it is currently computed by both Web of Science and 
Scopus. 

The index is meant to capture both productivity and impact of a scholar in 
such a way that it is hard to increase it, as well as to rig it, over a certain thresh- 
old. It favors researchers who produce a continuous stream of influential papers 
over those who publish many quickly forgotten ones or a few blockbusters. More- 
over, it is difficult to inflate the index, for instance with self-citations. Indeed, 
all self-citations to papers with less than h citations are irrelevant for the com- 
putation of the index, as are the self-citations to papers with many more than 
h citations. 

Hirsch argues that the h index is preferable to other single-number criteria 
commonly used to evaluate scientific output of a researcher [22j and that it has 
more predictive power [25j. Hirsch suggests that, for a given researcher, h should 
increase approximately linearly with time, that is h = m • n, where n is the 
academic age in years and m is the slope of the linear function. The parameter 
m should provide a useful yardstick to compare scientists of different seniority. 
In particular: 

— a value of m around 1 characterizes a successful scientist; 

— a value of m around 2 characterizes outstanding scientists; 

— a value of m around 3 or higher characterizes truly unique individuals. 



An additional advantage of the h index is that it is extremely simple and 
comprehensible. Moreover, it can be easily computed by sorting the published 
papers in decreasing order with respect to the number of received citations and 
scrolling down the list until the rank of the paper is greater than the number of 
citations that it has. The preceding rank equals the h index. 

The h index has also been criticized; in particular, Glanzel [19) and Born- 
mann and Daniel [26 describe opportunities and limitations of the h index. The 
following are acknowledged limitations of the index: 

1. it puts newcomers at a disadvantage since both publication output and ci- 
tation rates will be relatively low; 

2. it does not account for the number of authors in a paper; 

3. it is discipline dependent; 

4. it disadvantages small but highly-cited paper sets too strongly; 

5. it allows scientists to rest on their laurels ("your papers do the job for you") 
since the index never decreases and it might increase even if no new papers 
are published. 

Many variations of the index have been proposed to correct the mentioned 
flaws: 

— Hirsch proposes to solve problems number 1 and 5 by dividing the h index 
by the scientific age of the author [22] : 

— to address problem number 2 and, partially, issue number 3, Batista et al. 
[2"7] suggest to adjust the original h index by dividing it by the mean number 
of researchers in the h publications of the Hirsch core; 

— Egghe [5H] proposes the g-index to account for problem number 4. Given 
a set of articles ranked in decreasing order of the number of citations that 
they received, the g-index is the (unique) largest number such that the top 
g articles received (together) at least g 2 citations. Moreover, for the same 
problem, Jin [29. suggests to use the average number of citations received 
by articles in the Hirsch core (the set of articles that determine the h index 
value); 

— in order to address issue number 5, Katsaros et al. [30) propose the contem- 
porary h index. The contemporary h index adds an age-related weighting to 
each cited article, giving less weight to older articles. 

In fact, all these variations did not attract much attention since they address 
a single issue without considering the others; hence the original version of the h 
index is still the most adopted. 

4.2 Bibliometric measures at the journal level 

The traditional measure of journal impact is the impact factor. Roughly, the 
impact factor of a journal is the average number of recent citations received by 
articles published in the journal. More precisely, the impact factor of a journal for 



a specific census year is the mean number of citations that occurred in the census 
year to the articles published in the journal during a target window consisting 
of the two previous years. Such a measure was devised by Garfield, the founder 
of the Institute for Scientific Information (ISI). Today, Thomson- Reuters, that 
acquired the ISI in 1992, computes the the impact factor for journals it tracks 
and publishes it annually in the Journal Citation Reports (JCR) in separate 
editions for the sciences and the social sciences. 

The impact factor has become a standard to evaluate the impact of journals. 
Nevertheless, the impact factor has many faults [31120132] ; the most commonly 
mentioned are: 

— the target window (2 years) is too narrow for more theoretical disciplines, 
e.g., mathematics, in which results need to be well digested before they are 
cited; 

— the impact factor does not normalize for the differences in citation practices 
across different disciplines; 

— it does not represent a typical value of the number of citations to articles in 
the journal when the citation distribution is skewed (asymmetric), which is 
the usual case in bibliometrics (see Section [6]); 

— it counts citations without weighting them with the prestige of the citing 
journals. 

It follows that impact factors highly vary across disciplines and over time 
[53] . Moreover, due to the skewness of citation distributions and the fact that the 
impact factor is essentially a mean value, it is a (common) misuse of the impact 
factor to predict the importance of an individual publication, and hence of an 
individual researcher, based on the impact factor of the publication's journal. 
Indeed, most papers published in a high impact factor journal will ultimately 
be cited many fewer times than the impact factor may seem to suggest. Finally, 
some journals that have high impact factors are popular publication sources but 
are not appreciated by domain experts, that is, they are not prestigious sources 

m- 

Nunes Amaral et al. propose a steady state version of the impact factor 
[35] . They show that there exists a steady state period of time specific to each 
journal such that the number of citations to paper published in the journal in 
that period will not significantly change in the future: poorly cited papers have 
stopped accruing citations, while the trickle of citations to highly cited ones is 
small when compared to the already accrued citations. Hence, there are journal- 
specific census and target windows that well characterize the final impact of 
journals; such windows highly diverge from the ones exploited in the impact 
factor computation. 

Furthermore, the authors demonstrate that the logarithm of the number of 
citations to papers published in a journal in its steady state period is approx- 
imately normally distributed and hence it has a well-defined typical value (the 
mean). The authors propose to use such a mean as an alternative impact met- 
ric to the commonly used 2-year impact factor. They show that the suggested 



ranking scheme strongly diverges from the 2-year impact factor one, but it is 
very similar to the probability ranking scheme. The latter is the ranking that 
maximizes the probability that given a pair of papers (a, b) from journals A and 
B, respectively, paper a is more cited than paper b if A is above B in the rank- 
ing. The probability ranking is regarded as the optimal ranking in the context 
of information retrieval |36j . 

Both the impact factor and its steady version equally weights all citations: 
citations from highly reputed journals, like Nature, Science, and Proceedings of 
the National Academy of Sciences of USA, are treated as citations from more 
obscure ones. In other words, they are measures of popularity, but do not account 
for prestige. By contrast, the Eigenfactor™ metric [37138139] weights journal 
citations by the influence of the citing journals. As a result, a journal is influential 
if it is cited by other influential journals. The definition is clearly recursive in 
terms of influence and the computation of the Eigenfactor scores involves the 
search of a stationary distribution, which corresponds to the leading eigenvector 
of a perturbed citation matrix. 

The Eigenfactor method was initially developed by Jevin West, Ben Althouse, 
Martin Rosvall, and Carl Bergstrom at the University of Washington and Ted 
Bcrgstrom at the University of California Santa Barbara. Eigenfactor scores are 
freely accessible at the Eigenfactor web site [39] and, from 2007, they have been 
incorporated into Thomson-Reuters Journal Citation Reports (JCR) for both 
science and social science journal^. 

The idea underlying the Eigenfactor method originates from the work of [40] 
in the field of bibliometrics and from the contribution of [41] in the context 
of sociometry, which, in turn, generalizes Leontief 's input-output model for the 
economic system [42] . Notably, Brin and Page use a similar intuition to design 
the popular PageRank algorithm that is part of their Google search engine: the 
importance of a web page is determined by the number of hyperlinks it receives 
from other pages as well as by the importance of the linking pages |43|14j . 

In the following, we illustrate the Eigenfactor method to measure journal 
influence as described at the Eigenfactor web site [39] . The Eigenfactor compu- 
tation uses a census citation window of one year and an earlier target publication 
window of five years. Let us fix a census year and let C = (c,-j) be a journal- 
journal citation matrix such that Cij is the number of citations from articles 
published in journal i in the census year to articles published in journal j dur- 
ing the target window consisting of the five previous years. Hence, the ith row 
represents the citations given by journal i to other journals, and the jth column 
contains the citations received by journal j from other journals. Journal self- 
citations are ignored, hence Cj ; j = for all i. Moreover, let a be an article vector 
such that a,i is the number of articles published by journal i over the five-year 
target window divided by the total number of articles published by all journals 
over the same period. Notice that a is normalized to sum to 1. 
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A dangling node is a journal i that does not cite any other journals; hence, 
if i is dangling, the ith row of the citation matrix has all entries. The citation 
matrix C is transformed into a normalized matrix H — [h^j) such that all rows 
that are not dangling nodes are normalized by the row sum, that is, 



h ■ - J 



J2j c »,j 

for all non-dangling i and all j. Furthermore, H is mapped to a matrix H in 
which all rows corresponding to dangling nodes are replaced with the article 
vector a. Notice that H is row-stochastic, that is all rows are non-negative and 
sum to 1. 

A new row-stochastic matrix P is defined as follows: 

P = aH + (1 -a)A 

where A is the matrix with identical rows each equal to the article vector a, 
and a is a free parameter of the algorithm, usually set to 0.85. Let 7r be the left 
eigenvector of P associated with the unity eigenvalue, that is, the vector 7r such 
that 7r = txP . It is possible to prove that this vector exists and is unique. The 
vector 7r, called the influence vector, contains the scores used to weight citations 
allocated in matrix H. Finally, the Eigenfactor vector r is computed as 

ttH 

r = 100 • 



That is, the Eigenfactor score of a journal is the sum of normalized citations 
received from other journals weighted by the Eigenfactor scores of the citing 
journals. The Eigenfactor scores are normalized such that they sum to 100. 

The Eigenfactor metric has a solid mathematical background and an intu- 
itive stochastic interpretation. The modified citation matrix P is row-stochastic 
and can be interpreted as the transition matrix of a Markov chain on a finite 
set of states (journals). Hence, the influence vector n corresponds to the station- 
ary distribution of the associated Markov chain. Since P is a primitive matrix, 
the Markov theorem applies, hence tt is the unique stationary distribution and, 
moreover, the influence weight Hj of the jth journal is the limit probability of 
being in state j when the number of transition steps of the chain tends to in- 
finity. Moreover, the Perron theorem for primitive matrices ensures that 7r is 
a strictly positive vector corresponding to the leading eigenvector of P, that 
is, the eigenvector associated with the largest eigenvalue - which is 1 because 
P is stochastic. The described stochastic Markov process has an intuitive in- 
terpretation in terms of random walks on the citation network [3H]. Imagine a 
researcher that moves from journal to journal by following chains of citations. 
The researcher selects a journal article at random and reads it. Then, he retrieves 
at random one of the citations in the article and proceeds to the cited journal. 
Hence, the researcher chooses at random an article from the reached journal and 
goes on like this. Eventually, the researcher gets bored of following citations, 



and selects a random journal in proportion to the number of article published 
by each journal. With this model of research, by virtue of the Ergodic theorem 
for Markov chains, the influence weight of a journal corresponds to the relative 
frequency with which the random researcher visits the journal. 

The Eigenfactor score is a size-dependent measure of the total influence of a 
journal, rather than a measure of influence per article, like the impact factor. To 
make the Eigenfactor scores size-independent and comparable to impact factors, 
we need to divide the journal influence by the number of articles published in 
the journal. In fact, this measure, called Article Influence™, is available both 
at the Eigenfactor web site and at Thomson-Reuters's JCR. 

5 Bibliometric data sources 

Bibliometric analysis can be conducted on the bases of any sufficiently large 
bibliographic database enhanced with citation counts. A bibliometric data source 
may be evaluated according to the following criteria: 

— the coverage of the database; 

— the supported features for searching, sorting, and exporting bibliographic 
data as well as for computing performance indicators on them; 

— the availability of the database (free or subscription-based). 

In particular, coverage is of crucial importance since it influences the out- 
comes of the computation for bibliometric indicators. An uneven coverage may 
produce performance measures that are too far from the real figures and this may 
lead to wrong decisions. Some aspects directly connected to database coverage 
are: 

— what types (journals, conference papers, books, and so on) of works are 
covered and how evenly; 

— what research fields are covered and how evenly; 

— what languages other than English and what countries other than North 
American and Western European ones are covered and how evenly. 

The bibliometric databases of the Institute for Scientific Information (ISI) 
have been the most generally accepted data sources for bibliometric analysis. The 
ISI was founded by Eugene Garfield in 1960. The ISI was acquired by Thomson 
in 1992, one of the world's largest information companies. In 2007, the Thomson 
Corporation reached an agreement with Reuters to combine the two companies 
under the name Thomson- Reuters (TR). 

TR maintains Web of Knowledge, an online academic database which pro- 
vides access to many resources, in particular: 

— Web of Science (WoS), which includes the Science Citation Index (SCI), the 
Social Science Citation Index (SSCI), and the Arts and Humanities Citation 
Index (AHCI); 



— Journal Citation Reports (JCR), containing citation information, and in par- 
ticular the impact factor, for the journals tracked by TR. JCR are published 
annually in separate editions for the sciences and the social sciences. 

The use as TR citation databases, in particular as the only bibliographic 
source for bibliometric analysis, attracted quite a number of critics. The most 
mentioned flaws are: 

1. it provides different coverage between research fields; 

2. it is limited to citations from journals but does not count citations from 
other sources, mainly from books and most conference proceedings; 

3. it covers mainly North American, Western European, and English-language 
titles; 

4. is only available to those academics whose institutions are able and willing 
to bear the subscription cost. 

Flaw number one is particulary serious since it is the main cause of the 
variation of the impact factor measure across fields [33) . The internal coverage 
of a bibliometric data source with respect to a field is defined as the fraction of 
citations coming from papers internal to the database and belonging to the field 
that match a paper in the same data source. The internal coverage highly varies 
across disciplines, e.g. 0.803 for molecular and cell biology, 0.552 for mathematics, 
and 0.226 for computer science. This means that, for instance, more than 3/4 
of the citations from computer science papers indexed in WoS are addressed 
to papers that are not contained in WoS. Furthermore, drawback number 2 
is critical for disciplines like computer science that heavily rely on conference 
publications and for humanities whose scholars frequently publish books. 

Two major alternatives to Web of Science are Elsevier's Scopus and Google 
Scholar. Scopus, as Web of Science, is a subscription-based proprietary databases. 
On the contrary, Google Scholar is freely accessible. There are many studies that 
compare citation data retrieved on different data sources. Table [5] displays some 
of these. The first column shows the publication reference of the study, the second 
column contains the set of data sources that are compared, while the research 
field of the publications considered in the study is given in the third column. 
The papers are sorted in chronological order. 

The rest of this section illustrates a large-scale comparison between Web of 
Science, Scopus and Google Scholar conducted by Meho and Yang in 2007 [56] , 
The study covers more than 10000 citations to approximately 1100 scholarly 
works of all 15 faculty members of the School of Library and Information Science 
(LIS) at Indiana University-Bloomington. The authors found that both Web of 
Science and Scopus provide substantial factual information about the database, 
including the number of records and the list of titles indexed. On the contrary, 
Google Scholar refuses to publish information about its coverage and frequency 
of updates. 

Moreover, both Web of Science and Scopus offer features for searching, sort- 
ing, and exporting the bibliographic data. On the contrary, Google Scholar does 
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Table 4. Literature comparing citation data over different data sources. 



not provide the retrieved data in a useful bibliographic format and does not al- 
low sorting it in any way. Collecting (extracting, verifying, cleaning, organizing, 
classifying, and saving into a bibliographic format) data from Google Scholar 
took the authors 30 as much time as collecting Web of Science data and 15 as 
much time M.S MS collecting Scopus data. 

The authors analyzed the citations to works of LIS members published be- 
tween 1996 and 2005 divided by document type. For Web of Science and Scopus, 
most of these citations come from journals (88.7% and 84.4%, respectively), and 
a few from conference papers (11.3% and 15.6%, respectively). On the contrary, 
Google Scholar indexes more types of works, including journal papers (42.5%), 
conference papers (33.7%), theses (9.8%), books (5.5%), reports (4.8%) and other 
document types (3.7%); 

Furthermore, the authors studied the distribution of unique and overlap- 
ping citations to works of LIS members published between 1996 and 2005 that 
were found in journal and conference articles only. Web of Science contains 2023 
citations, Scopus contains 2301 citations, and Google Scholar contains 4181 ci- 
tations. The overlap of citations between the three databases is relatively low, 
with significant differences from one research area to another: the overlapping 
between Web of Science and Scopus is 1591 citations (58.2%) out of the 2733 
citations found in both the databases; Web of Science misses 710 citations (26%) 
of those of Scopus while Scopus misses 432 citations (15.8%) of those of Web of 
Science. 

The overlapping between Google Scholar and the union of Web of Science 
and Scopus is 1629 citations (30.8%) out of the 5285 citations found in all the 
three databases: Google Scholar misses 1104 citations (20.9%) of those found in 
the union of Web of Science and Scopus, and the union of Web of Science and 
Scopus misses 2552 citations (48.3%) of those found in Google Scholar. 

The former figure is quite striking, since virtually all citations of Web of Sci- 
ence and Scopus come from referred reputable sources. Rumors are that some 
publishers did not allow Google Scholar crawlers to enter their databases (no- 
tably, Elsevier and American Chemical Society). As for the second figure, the 
authors noticed that: 

— most of the citations uniquely found by Google Scholar are from refereed 
sources; 

— most of these citations come from low impact sources; 

— most of these citations were identified trough documents made available 
online by their authors rather than from the source's official site. 

The authors studied the distribution of citations by language and found 
that Google Scholar provides better coverage of non-English language materials 
(6.9%) with respect to both Web of Science (1.1%) and Scopus (0.7%). 

Meho and Yang concluded that Web of Science, Scopus, and Google Scholar 
complement rather than replace each other, so they should be used together 
rather than separately in citation analysis. In particular, although Web of Sci- 
ence remains an indispensable citation database, it should not be used alone for 



locating citations, because both Scopus and Google Scholar identify a consider- 
able number of citations not found in Web of Science. Although Google Scholar 
unique citations are not of the same quality of those found in the two proprietary 
databases, they could be useful in showing evidence of broader international im- 
pact. 

The authors also concluded that there is an important impact advantage in 
favor of the articles, and the corresponding journals, that their authors make 
available online (on personal web pages or on electronic preprints archives like 
arXiv) since they are more likely discovered by human and automatic agents 
(like crawlers of Google Scholar), possibly increasing the citation impact. 



6 Bibliometric distributions 

The probability distributions that are usual suspects in bibliometrics are Pareto, 
(stretched) exponential, and lognormal distributions. 

Pareto distribution. Also known as power law distribution, it has been used to 
model phenomena where most of the effects come from few of the causes. The 
distribution is named after the Italian economist Vilfredo Pareto who originally 
observed it studying the allocation of wealth among individuals: a larger share 
of wealth of any society (approximately 80%) is owned by a smaller fraction 
(about 20%) of the people in the society [53]. Examples of phenomena that 
are approximately Pareto distributed are: size of human settlements, size of 
meteorites, standardized price returns on individual stocks, hie size of Internet 
traffic using TPC protocol, duration of transactions in database management 
systems, word frequency in relatively lengthy texts (Zipf law [6j), and scientific 
productivity of scholars (Lotka law [4]). Furthermore, in 1998 Redner analyzed 
the citation distribution for all papers in journals which were catalogued by 
the ISI at that time; the author found that the asymptotic tail of the citation 
distribution appears to be described by a Pareto law [53] . 

The probability density function for a Pareto distribution is defined for x > 1 
in terms of parameter a > as follows: 

m 



x a + l 

The cumulative distribution function is: 

F(x) = 1 - — 

The mean is a/ (a — 1) for a > 1, and infinite otherwise. The median is \/2 and 
the mode is 1. Notice that the mean is greater than the median which is greater 
than the mode and the limit for a — > oo of both the mean and the median is the 

mode 1. Skewness is 

^ _ 2(1 + a) fa- 



for a > 3, and kurtosis is 



_ 6(a 3 + a 3 - 6a - 2) 
K ~ a{a - 3) (a - 4) 

for a > 4. Both skewness and kurtosis are greater than zero and tend to 2 and 6, 
respectively, as a — > oo. The raw moments are found to be E(X n ) = a /(a — n) 
for a > n. 

Stretched exponential distribution. This is a family of extensions of the well- 
known exponential distribution characterized by fatter tails. Laherrere and Sor- 
nette showed that different phenomena in nature and economy can be described 
in the regime of the exponential distribution, including radio and light emission 
from galaxies, oilfield reserve size, agglomeration size, stock market price vari- 
ation, biological extinction event, earthquake size, temperature variation of the 
earth, and, notably, citation of the most cited physicists in the world [65j . 

The probability density function is a simple extension of the exponential 
distribution with one additional stretching parameter a: 

f{x) = aX a x a - 1 e- {Xxr 

where x > 0, A > and < a < 1. In particular, if the stretching parameter 
a = 1, then the distribution is the usual exponential distribution. When the 
parameter a is not bounded from 1, the resulting distribution is better known 
as the Weibull distribution. The cumulative distribution function is: 

F(x) = 1 - e'^" 

It can be shown that the nth raw moment E(X n ) is ^-^(^i 2 ), where r(x) is 
the Gamma function, an extension of the factorial function to real and complex 
numbers, defined by: 

/•OO 

r(x) = / * a - 1 e _ *dt 

In particular, it holds that -T(l) = 1 and r(x+ 1) = xT(x). Hence, for a positive 
integer n, we have r(n + 1) = n\. Notice that, if a — 1, then the raw moments 
are given by n\/X n and they correspond to the raw moments of the exponential 
distribution. In particular, the mean E(X) is the first raw moment that is equal 
to jr(2^-). The median is given by j \/Iog2 and the mode is 0. Notice again 
that the mean is greater than the median that is greater than the mode. 

Lognormal distribution. It is the distribution of any random variable whose loga- 
rithm is normally distributed. A lognormal distribution characterizes phenomena 
determined by the multiplicative product of many independent effects. The log- 
normal distribution is a usual suspect in bibliometrics. In a 1957 study based 
on the publication record of the scientific research staff at Brookhaven National 
Laboratory, Shockley observed that the scientific publication rate is approxi- 
mately lognormally distributed [66]. More recently, Stringer et al. studied the 
citation distribution for journals indexed in Web of Science publishing at least 



50 articles per year for at least 15 years and demonstrated that in a steady 
citational state the logarithm of the number of citations has a journal-specific 
typical value [35] . Finally, Radicchi et al. analyzed the distribution of the ratio 
between the number of citations received by an article and the average number 
of citations received by articles published in the same field and year for papers 
in different research categories (the category closest to computer science is cy- 
bernetics) |21j . They found a similar distribution for each category with a good 
fit with the lognormal distribution. 

The lognormal probability density function is defined in terms of parameters 
\i and a > as follows: 

1 (lo 8 (x)- M ) 2 

/O) = - 
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for x > 0. The parameters fi and a are the mean and standard deviation of 
the variable's natural logarithm. The cumulative distribution function has no 
closed-form expression and is defined in terms of the density function as for 
the normal distribution. The mean is e' i+0 ' / 2 , the median is and the mode is 
gM-f _ Notice that the mean is greater than the median which is greater than the 
mode. This suggests a positive asymmetry of the distribution. Indeed, skewness 
is (e a2 -l)\fe° - 1 > 0. Moreover, excess kurtosis is e 4 ^ + 2e 3a2 + 3e 2<j2 - 6 > 0. 
The raw moments are given by E(X n ) — e n ^ +n a / 2 . 

It is interesting to observe that all the above distributions that are com- 
monly used to model bibliometric phenomena are positively (right) skewed. A 
distribution is symmetric if the values are equally distributed around a typical 
figure (the mean); a well-known example is the normal (Gaussian) distribution. 
A distribution is right-skewed if it contains many low values and a relatively few 
high values. It is left-skewed if it comprises many high values and a relatively 
few low values. As a rule of thumb, when the mean is larger than the median 
the distribution is right-skewed and when the median dominates the mean the 
distribution is left-skewed. A more precise numerical indicator of distribution 
skewness is the third standardized central moment, that is 

E[{X /i) 3 ] 
7= ^ 

where fi and a are mean and standard deviation of distribution of random vari- 
able X, respectively. A value close to indicates symmetry; a value greater than 
corresponds to right skewness, and a value lower than means left skewness. 

The observed right skewness might be considered as an application of the 
more general Pareto Principle (also known as 80-20 rule) [7] . The principle states 
that: 

Most (approximately 80%) of the effects comes from few (about 20%) of 
the causes. 



It has been suggested that the irreducible skewness of distributions of scholar 
productivity and article citedness may be explained by sociological reinforcement 



mechanisms such as the Principle of Cumulative Advantage. De Solla Price for- 
mulated this in 1976 as follows [67 : 



Success seems to breed success. A paper which has been cited many times 
is more likely to be cited again than one which has been little cited. An 
author of many papers is more likely to publish again than one who has 
been less prolific. A journal which has been frequently consulted for some 
purpose is more likely to be turned to again than one of previously infre- 
quent use. 

The Matthew Effect may additionally contribute to skewness. According to 
Merton [68] : 

The Mathew Effect consists in the accruing of greater increments of 
recognition for particular scientific contributions to scientists of consid- 
erable repute and the withholding of such recognition from scientists who 
have not yet made their mark. 

It takes the name from the following line in Jesus' parable of the talents in 
the biblical Gospel of Mathew: For unto every one that hath shall be given, and 
he shall have abundance: but from him that hath not shall be taken away even 
that which he hath. 

Interestingly, Seglen claims that skewness is an intrinsic characteristic of 
distributions related to extreme types of human efforts; although scientific ability 
may be normally distributed in the general population, scientists are likely to 
form an extreme-property distribution to their speciality be in terms of citedness 
or in terms of productivity. This statistical pattern is expected for different types 
of highly specialized human activity, a parallel being found in the distribution 
of performance by top athletes [TT] . 

7 Bibliometric maps 

A first and crucial step is the building of a map is definition of a research 
field. There are two main approaches: concept-similarity mapping and citation 
mapping. The concept-similarity approach defines a research field on the basis of 
repeated concepts (keywords) in publications |69l70j . It can be further divided 
in the two following methods: co-publication analysis, in which publications are 
related if they mention the same concepts, and co-concept analysis, in which 
concepts are related if they are mentioned together in the same publication. 

The citation approach clusters a research field on the basis of citations in 
publications. Two typical methods to identify similar publications are co-citation 
coupling [7T] , in which publications are related when they are cited by the same 
papers, and bibliographic coupling |72) . in which papers are related when they 
cite the same papers. In the following, we introduce these two techniques with 
the help of a model suggested in [73] and refined in [74]. Suppose we have n 
publications pi,...,p n that cite m references n, ... ,r m . We build a Boolean 



citation matrix C = (cij) of size n x m such that Cjj = 1 if Pi cites Tj and 
Cij — otherwise. Let c, = ^ . Cjj be the number of cited references of pi and 
e 7 = ^ i Cij be the number of citations received by rj . A measure of bibliographic 
coupling between publications Pi and pj is: 

' i,j 

This is the ratio of the number of references shared by publications pi and pj and 
the geometric mean of the number of references of the two papers concerned. 
Notice that < rtj < 1, and rjj = when publications pi and pj share no 
references, while r^j = 1 when publications Pi and pj have the same bibliography. 
Geometrically, nj is the cosine of the angle formed by the ith and jth rows of 
the citation matrix, which is when the two vectors are orthogonal, and is 1 
when they are parallel. In matrix notation, let A = (ajj) = CC T , that is, djj is 
the number of references shared by ith and jth publications, and, in particular, 
= Ci is the number of references of pi. Let D be the diagonal matrix such 
that the ith diagonal entry is 1/ ^/EiJ. Then we have that R = (nj ) is defined 
as 

R = DAD 

On the other hand, a measure of co-citation coupling between publications 
Pi and pj is: 

_ Sfc C k,i ' c k,j 

This is the ratio of the number of articles that cite both publications Pi and pj 
and the geometric mean of the number of citations received by the two publi- 
cations involved. This is also the cosine of the angle formed by the ith and jth 
columns of the citation matrix. Again, < Sij < 1, and Sjj = when pub- 
lications Pi and pj are never co-cited, while Sjj = 1 when publications pi and 
Pj are always cited together. In matrix notation, let B = (hj) = C T C, that 
is, bij is the number of articles that co-cited ith and jth publications, and, in 
particular, 6^ = c 1 is the number of citations gathered by pi. Let D' be the 
diagonal matrix such that the ith diagonal entry is 1/ 'y / %~i- Then we have that 
S = (sij) is defined as 

S = D'BD' 

It is worth noticing that the similarity formulas used in citation coupling 
closely resemble Pearson correlation coefficient formula for two statistical sam- 
ples x and y, that is: 

_ Oxy _ J2k( x k - (J'x) ■ (Vk - ^y) 



(J X ■ (Jy 



VT,k( X k - (J-x) 2 ■ VEkiVk ~ Vyf 



In particular, when the means of both statistical samples x and y are null, the 
Pearson correlation coefficient is exactly the cosine of the angle formed by the 
two sample vectors and the two measures coincide. 



Once the similarity strength between bibliometric units has been established, 
bibliomctric units are typically represented as graph nodes and the similarity 
relationship between two units is represented as a weighted edge connecting 
the units, where weights stand for the similarity intensity. Such visualizations 
are called bibliometric maps. Such maps are powerful but they are often highly 
complex. It therefore is helpful to abstract the network into inter-connected 
modules of nodes. Good abstractions both simplify and highlight the underlying 
structure and the relationships that they depict. When the units are publications 
or concepts, the identified modules represent in most cases recognizable research 
fields. In the rest of this section, we describe three methods for creating these 
abstractions: clustering, principal component analysis, and information-theoretic 
abstractions. 

7.1 Clustering 

Informally, clustering is the process of organizing objects into groups whose 
members are similar in some way |75l76j . A cluster is a collection of objects 
which are similar between them and are dissimilar to objects belonging to other 
clusters. Clustering can be formalized as follows. We are given a weighted undi- 
rected graph G, where the weight function assigns a dissimilarity value to pair of 
nodes, and an objective function / that assigns a value of merit to any partition 
of the set of nodes of G. Clustering problems are optimization problems that 
usually have one of the following forms [77] : 

— Let G be a graph, / be an objective function, and k be an integer. Find a 
partition of nodes in G with cardinality k and with the least value for the 
objective function. 

— Let G be a graph, / be an objective function, and c be a real number. Find 
the smallest partition of nodes in G with objective function value less than 
or equal to the value c. 

The first type of clustering problem is usually approached using repeated par- 
tition techniques. These techniques choose an initial partition with k clusters and 
then move objects between clusters trying to minimize the objective function. 
The procedure stops as soon as a local minimum for the objective function is 
reached. The most popular algorithm in this category is K-means [75] , 

Hierarchical clustering methods are typically applied to solve the second type 
of clustering problem [79]. In this case, the size of the partition is not fixed in 
advance. These algorithms are of two kinds: agglomerative and divisive. An 
agglomerative strategy starts with a singleton partition containing a cluster for 
each object and then merges similar clusters until the universal partition is 
obtained. A divisive strategy starts from the universal partition containing a 
unique set with all objects and then divides clusters that include dissimilar 
objects until the singleton partition is reached. Both methods can use different 
methods to decide what clusters to join or to divide. They output a hierarchical 
structure (a dendrogram) describing the whole merging/dividing process. This 



structure can be used to choose the smallest partition among the generated ones 
(a small subset of all partitions) with objective function value less than or equal 
to the given threshold. 

The computational complexity of clustering problems mainly depends on the 
properties of the weight function that measures the distance between two objects 
and on the objective function that evaluates the goodness of a given partition of 
the space. Many exact and approximated clustering problems are known to be 
hard to solve, in particular NP-hard |77l80j . Hence a polynomial strategy cannot 
guarantee to find the optimum solution. 

7.2 Principal component analysis 

Principal component analysis (PCA) 81 82] is a multivariate statistic method 
used to reduce a multi-dimensional space to a lower dimension. 

Given a set of correlated variables X = {Xi, X%, ■ ■ ■ , X n }, the aim of PCA 
is to find new artificial variables Y = {Yi, Y2, ■ ■ . , Y m }, with m < n, such that 
(i) each new variable Yi is obtained as a linear combination of the original vari- 
ables (ii) the new variables Yi are pairwise uncorrelated, (iii) the variance of 
Yi decreases as the index i increases, and (iv) the sum of the variance of the 
new variables Yi is a significant portion of the sum of the variance of the orig- 
inal variables Xi. The principal components Yi represent the most informative 
orthogonal aspects of the data set. 

A simple method to find principal components is the following: 

1. find the covariance matrix Ex of variables in X; 

2. compute the eigenvalues Aj of Ex and sort them in descending order: Ai > 
A 2 > . . . A„ > 0; 

3. find the eigenvectors associated with the eigenvalues A^. The i-th principal 
component Yi is Y^J—y e i,jXj, where e^j is the j-th component of vector ef, 

4. determine the m < n most informative principal components Yi, Y2, ■ ■ ■ , Y m 
such that: 

Ai + A2 + . . . + A m 

> a 

Ai + A 2 + . . . + A„ ~ 

where < a < 1 is a threshold (often fixed at 0.8). 

An alternative method (Kaiser method) to isolate the m principal compo- 
nents is to choose those components such that Ai > 1. Notice that, since Ex is 
semi-definite positive, its eigenvalues are greater than or equal to 0. Moreover, it 
holds that the variance var(Yi) = Xi and J27=i var (Xi) — tr(Ex) = X)"=i ^« = 
Y2i=i var(Yi). Hence, the most informative principal components contribute at 
least a fraction of a to the total variance of the original data set X. If variables 
in X have different units of measure, then they must be standardized before 
applying the method. This is equivalent to work with the correlation matrix Rx 
instead of with the covariance matrix Ex- 

The contribution of the original variable Xj to the new variable Yi is given by 
the eigenvector component e<j. The highest is ejj in absolute value, the highest 



is the contribution of Xj to Yi. Moreover, it holds that the correlation between Yi 
and Xj is eijy/Xi/ sd(Xj), where sd(Xj) is the standard deviation of Xj. Hence 
the sign of e^j gives the sign of the correlation between Yi and Xj. It turns out 
that variables Xj can be clustered by associating each Xj to the component Yi 
such that the two variables are most correlated. 

7.3 Information-theoretic abstractions 

Rosvall and Bergstrom [83] propose a model for resolving community structure 
in complex networks based on information theory. They start from the following 
observation: when we describe a network as a set of interconnected modules, we 
are highlighting certain regularities of the network's structure while filtering out 
the relatively unimportant details. Thus, a modular description of a network can 
be viewed as a lossy compression of that network's topology. The best maps are 
those that convey a great deal of information while requiring minimal bandwidth; 
i.e., they are good compressions. This view suggests that we can approach the 
challenge of identifying the community structure of a complex network as a 
problem in information theory The authors envision the process of abstraction 
of a complex network as a communication process. The link structure of the 
network is a random variable X; this is compressed into a simplified description 
Y which is sent through a noiseless communication channel. The receiver uses the 
abstraction Y to make guesses Z about the structure of the original network X. 
The partition of the original network is achieved by minimizing the length L(Y) 
of the abstraction Y plus the length L(X\Y) of the additional information that is 
necessary to describe the original network X given its simplified representation 
Y . The minimization problem is tackled using the simulated annealing approach. 

In a successive work |84) . the same authors use ergodic random walks on 
complex directed weighted networks to reveal community structure. The intu- 
ition here is as following. The local interactions among the subunits of a network 
system induce a system-wide flow of information that characterizes the behavior 
of the whole system. Consequently, if we want to understand how network struc- 
ture relates to system behavior, we need to understand the flow of information 
on the network. A group of nodes among which information flows quickly and 
easily can be aggregated and described as a single well connected module; the 
links between modules capture the avenues of information flow between those 
modules. The authors use an infinite random walk as a proxy of the informa- 
tion flow and identify the modules that compose the network by minimizing the 
expected description length of the ergodic random walk within and across the 
modules. This is the sum of the entropy of the movements across modules and 
of the entropy of movements within modules. Huffman code [85] is exploited 
to encode the random walk by assigning short codewords to frequently visited 
nodes and modules, and longer codewords to rare ones, much as common words 
are short in spoken language [B]. Shannon source coding theorem [86 provides 
a lower bound to the average length of a codeword. The minimization problem 
is approached using a greedy search algorithm and the solution is refined with 
the aid of simulated annealing. 



8 Quotations 



"Measuring is knowing" - Heike Kamerlingh Onnes 

"Not everything that can be counted counts, and not everything that 
counts can be counted" - Albert Einstein 

"If scientometrics is a mirror of science in action, then scientometri- 
cians' particular responsibility is to both polish the mirror and warn 
against optical illusions" - Michel Zitt 

"No amount of fancy statistical footwork will overcome basic inadequa- 
cies in either the appropriateness or the integrity of the data collected" 

- Goldstein and Spiegelhalter 

"We think of statistics as facts that we discover, not numbers we create" 

- Joel Best 

"Citations are frozen footprints in the landscape of scholarly achievement 
which bear witness of the passage of ideas" - Blaise Cronin 

"The use of a single index crashes the multidimensional space of biblio- 
metrics into one single dimension" - Wolfgang Gldnzel 

"For unto every one that hath shall be given, and he shall have abun- 
dance: but from him that hath not shall be taken away even that which 
he hath" - Jesus of Nazareth 
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